ScreenSpot-Pro

A GUI grounding benchmark for professional high-resolution computer use — testing whether AI can locate tiny UI elements across 23 applications and 5 industries

Published

September 11, 2025

Keywords: ScreenSpot-Pro, GUI grounding, GUI agent, screen understanding, UI element localization, high-resolution, professional software, multimodal LLM, computer use, visual grounding, Photoshop, AutoCAD, VSCode, MLLM benchmark, ScreenSeekeR

Introduction

GUI agents — AI systems that can operate computer interfaces on behalf of users — represent one of the most ambitious frontiers in AI. But while models have made progress on simple tasks like web browsing and mobile navigation, they collapse on professional software. The dense toolbars, tiny icons, and high-resolution multi-panel layouts of applications like Photoshop, AutoCAD, MATLAB, and Visual Studio Code remain far beyond their reach.

ScreenSpot-Pro quantifies this gap. It is a GUI grounding benchmark featuring 1,581 expert-annotated tasks across 23 professional applications, 5 industries, and 3 operating systems — all captured at authentic high resolutions. The challenge: given a natural language instruction and a full-screen screenshot, locate the exact UI element to click. Targets occupy only 0.07% of the screen area on average — 29× smaller than the original ScreenSpot benchmark.

“Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%.” — ScreenSpot-Pro Paper

graph LR
    A["ScreenSpot<br/>Cropped screenshots<br/>Target: 2.01% of image"] --> B["Too easy<br/>for frontier models"]
    B --> C["ScreenSpot-Pro<br/>Full-screen, high-res<br/>Target: 0.07% of image"]
    C --> D["Tests real-world<br/>professional GUI<br/>grounding"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is ScreenSpot-Pro?

ScreenSpot-Pro is a benchmark that evaluates whether multimodal large language models (MLLMs) can ground natural language instructions to precise UI element locations in high-resolution professional screenshots. Unlike prior benchmarks that used cropped or simplified screenshots, ScreenSpot-Pro uses full, unmodified screen captures from real expert workflows.

Key Characteristics

Feature Details
Total tasks 1,581 instructions (each in a unique screenshot)
Applications 23 across 5 professional industries + OS commons
Operating systems Windows, macOS, Linux
Resolution >1080p (1920×1080), including dual-monitor setups
Target size 0.07% of image area on average (29× smaller than ScreenSpot)
Element types Text (62.6%) and Icons (37.4%)
Annotation Expert users with 5+ years experience; dual-reviewer quality control
Multilingual English + Chinese instructions for all tasks
License CC BY 4.0

Applications and Industries

ScreenSpot-Pro covers a uniquely diverse range of professional software:

graph TD
    SSP["ScreenSpot-Pro<br/>23 Applications"] --> DEV["Development<br/>& Programming"]
    SSP --> CRE["Creative<br/>Software"]
    SSP --> CAD["CAD &<br/>Engineering"]
    SSP --> SCI["Scientific &<br/>Analytical"]
    SSP --> OFF["Office<br/>Suite"]
    SSP --> OS["Operating System<br/>Commons"]

    DEV --> D1["VSCode · PyCharm<br/>Android Studio<br/>Quartus · VMware"]
    CRE --> C1["Photoshop · Premiere<br/>Illustrator · Blender<br/>FruitLoops · Unreal Engine<br/>DaVinci Resolve"]
    CAD --> CA1["AutoCAD · SolidWorks<br/>Inventor · Vivado"]
    SCI --> S1["MATLAB · Origin<br/>Stata · EViews"]
    OFF --> O1["Word · PowerPoint<br/>Excel"]
    OS --> OS1["Windows 11<br/>macOS · Linux"]

    style SSP fill:#e74c3c,color:#fff,stroke:#333
    style DEV fill:#3498db,color:#fff,stroke:#333
    style CRE fill:#27ae60,color:#fff,stroke:#333
    style CAD fill:#f39c12,color:#fff,stroke:#333
    style SCI fill:#8e44ad,color:#fff,stroke:#333
    style OFF fill:#e67e22,color:#fff,stroke:#333
    style OS fill:#6cc3d5,color:#fff,stroke:#333

What Makes It So Hard?

The core difficulty comes from three compounding factors:

  1. Professional complexity — applications like AutoCAD and MATLAB have hundreds of densely packed buttons, menus, and panels
  2. High resolution, tiny targets — at full-screen resolution, the target element averages only 0.07% of the image area
  3. Specialized icons — professional tools use domain-specific icons that are rarely seen in web training data

In the original paper, GPT-4o scored only 0.8% on direct grounding — barely above random chance. Even the best specialist model (OS-Atlas-7B) achieved just 18.9%.

Who Built It?

ScreenSpot-Pro was developed by researchers at the National University of Singapore (NUS), East China Normal University, and Hong Kong Baptist University:

  • Kaixin Li, Zhiyong Huang, Tat-Seng Chua — National University of Singapore
  • Ziyang Meng — East China Normal University
  • Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma — Hong Kong Baptist University

The benchmark was published at the Workshop on Reasoning and Planning for Large Language Models (2025).

Resource Link
arXiv paper arxiv.org/abs/2504.07981
Leaderboard gui-agent.github.io/grounding-leaderboard
GitHub github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding

What Skills Does It Test?

ScreenSpot-Pro evaluates a very specific but critical capability: GUI visual grounding — the ability to translate a natural language instruction into a precise screen coordinate.

Capability What It Tests
High-resolution perception Processing screenshots at >1080p without losing detail
Tiny element localization Finding targets occupying 0.07% of the screen area
Professional domain knowledge Understanding industry-specific UI patterns (toolbars, panels, menus)
Icon comprehension Recognizing specialized icons (e.g., blend modes in Photoshop, circuit symbols in Vivado)
Cross-platform understanding Working across Windows, macOS, and Linux interfaces
Bilingual instruction following Grounding from both English and Chinese instructions

Example Tasks

Tasks range from straightforward to highly specialized:

  • “Refresh the file explorer” — VSCode (icon target)
  • “Unlink audio and video” — Premiere (text target in a context menu)
  • “Change the coordinate mode of the object” — Blender (icon target in a dense toolbar)
  • “Select the SM1.smf file in Quartus window” — Quartus (text target in a file browser)
  • “Disable masking” — Origin (tiny icon in a crowded toolbar)

Current Leaderboard

The leaderboard below shows model accuracy on ScreenSpot-Pro. The metric is click accuracy: whether the model’s predicted click point falls within the annotated ground-truth bounding box.

Source: ScreenSpot-Pro Leaderboard (consulted March 29, 2026). Last updated November 17, 2025. Results collected using greedy decoding; micro-average numbers reported.

Top 20 Models

Rank Model Overall (%)
1 KV-Ground-GuiOwl1.5-0315-8B-ZoomIn 80.5
2 Holo2-235B-A22B (Agentic) 78.5
3 MAI-UI-32B (MVP) 77.5
4 KV-Ground-GuiOwl1.5-4B-0228-ZoomIn 76.4
5 Holo2-30B-A3B (Agentic) 75.2
6 MVP_Qwen3VL-32B 74.1
7 MAI-UI-32B (Zoom In) 73.5
8 KV-Ground-GuiOwl1.5-0315-8B 73.2
9 MAI-UI-8B (Zoom In) 71.9
10 Holo2-8B (Agentic) 71.4
11 AdaZoom-GUI-Refine 71.3
12 Holo2-235B-A22B 70.6
13 KV-Ground-Qwen3VL-4B-ZoomIn 70.3
14 UI-Venus-1-5-30B-A3B 69.6
15 Holo2-4B (Agentic) 68.6
16 UI-Venus-1-5-8B 68.4
17 MAI-UI-32B 67.9
18 KV-Ground-GuiOwl1.5-0228-4B 67.0
19 Holo2-30B-A3B 66.1
20 MAI-UI-8B 65.7

Notable General-Purpose Models

Rank Model Overall (%)
41 Qwen2.5-VL-72B-Instruct 53.3
49 Qwen2.5-VL-32B-Instruct 48.0
56 UI-TARS-72B 38.1
70 GPT5-minimal (resized) 18.5
71 Claude (Computer Use) 17.1
83 GPT-4o 0.8

Key takeaways:

  • The best model (KV-Ground-GuiOwl1.5-8B with ZoomIn) achieves 80.5% — a massive leap from the original paper’s best of 18.9%, driven by visual search strategies that narrow the search area
  • Agentic / multi-round methods dominate the top ranks — models that zoom into candidate regions outperform single-pass approaches
  • General-purpose VLMs (GPT-4o, Claude Computer Use) still struggle severely on direct grounding in professional high-res environments
  • Even GPT-5 in minimal mode reaches only 18.5% when images are simply resized

Where to Explore the Benchmark

Dashboards and Resources

Resource Description Link
Official Leaderboard Live leaderboard with per-application breakdown across 23 software gui-agent.github.io/grounding-leaderboard
GitHub Repository Evaluation code, configs, and inference scripts github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding
Hugging Face Dataset The 1,581-task dataset with screenshots and annotations huggingface.co/datasets/likaixin/ScreenSpot-Pro
arXiv Paper Full technical paper with methodology and analysis arxiv.org/abs/2504.07981

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("likaixin/ScreenSpot-Pro")
print(f"Number of tasks: {len(dataset['test'])}")
# Number of tasks: 1581

Understanding the Metrics

Click Accuracy

The primary metric is straightforward: given a model’s predicted click point (x, y), does it fall inside the annotated ground-truth bounding box? For models that output bounding boxes instead of points, the center of the predicted box is used.

Per-Category Breakdown

The leaderboard reports accuracy per application, which reveals where models excel vs. struggle:

Category Challenge Level Why
Office Suite Moderate Familiar UI patterns, used in web training data
OS Commons Moderate Standard system interfaces
Development Hard Dense code editors, many small icons
Creative Very Hard Custom UIs, non-standard toolbars
CAD & Engineering Very Hard Extremely dense, specialized icons
Scientific Very Hard Domain-specific plots, menus with many entries

Text vs. Icon Targets

Icons are consistently harder to ground than text elements — models can leverage OCR capabilities for text but must rely on visual understanding for icons. In the original paper, OS-Atlas-7B scored 28.1% on text but only 4.0% on icons.

ScreenSeekeR: The Breakthrough Approach

The paper introduced ScreenSeekeR, an agentic visual search framework that dramatically improves grounding accuracy by narrowing the search area rather than trying to locate elements in the full high-resolution image. This insight — that reducing the search space matters more than increasing model size — proved foundational for the leaderboard leaders.

graph TD
    A["Full Screenshot<br/>High resolution"] --> B["Planner (GPT-4o)<br/>Predicts candidate regions"]
    B --> C["Score & Filter<br/>Candidate areas"]
    C --> D["Crop & Zoom<br/>Into top candidates"]
    D --> E["Grounder Model<br/>Locates target in<br/>simplified sub-image"]
    E --> F["Verify Result<br/>Planner checks<br/>correctness"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#3498db,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#8e44ad,color:#fff,stroke:#333

ScreenSeekeR boosted OS-Atlas-7B from 18.9% → 48.1% — a 254% relative improvement — without any additional training. This cascaded zoom-and-search approach inspired many of the top leaderboard methods (ZoomIn, MVP, Agentic variants).

Why ScreenSpot-Pro Matters

graph LR
    A["GUI agents need<br/>professional software<br/>capabilities"] --> B["Existing benchmarks<br/>too simple"]
    B --> C["ScreenSpot-Pro<br/>fills the gap"]
    C --> D["Better GUI agents<br/>for real productivity"]

    A2["High-res screens<br/>tiny UI targets"] --> B2["Models fail at<br/>precise localization"]
    B2 --> C
    C --> D2["Focus on<br/>visual search<br/>strategies"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Tests what matters for real productivity — Professional software is where GUI agents could deliver the most value, yet it’s the hardest environment
  2. Exposes the resolution bottleneck — Models that work on cropped screenshots fail catastrophically at full-screen resolution
  3. Validates visual search — The massive gap between single-pass (18.9%) and agentic zoom approaches (80.5%) proves that search strategy is critical
  4. Diverse and authentic — 23 applications across 5 industries, annotated by domain experts during real workflows
  5. Active community — 84 model submissions on the leaderboard and growing

Video: ScreenSpot-Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

ScreenSpot-Pro reveals a critical truth about AI GUI agents:

  • 1,581 expert-annotated tasks across 23 professional applications — from Photoshop and AutoCAD to MATLAB and Blender
  • Targets occupy only 0.07% of the screen — 29× smaller than the original ScreenSpot benchmark
  • General-purpose models like GPT-4o score < 1% on direct grounding in professional environments
  • Visual search strategies (zoom-and-crop) are the key breakthrough, with the best agentic methods reaching 80.5%
  • The gap between single-pass (18.9%) and multi-round approaches (80.5%) proves that the problem is not just about better models but about smarter search

As GUI agents evolve from web browsing toys into serious productivity tools, ScreenSpot-Pro provides the benchmark that measures whether they can handle the software that professionals actually use.

References

Read More